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Abstract —The ability to efficiently switch from one pre¬ 
encoded video stream to another (e.g., for bltrate adaptation 
or view switching) is important for many interactive streaming 
applications. Recently, stream-switching mechanisms based on 
distributed source coding (DSC) have been proposed. In order to 
reduce the overall transmission rate, these approaches provide 
a “merge” mechanism, where information is sent to the decoder 
such that the exact same frame can be reconstructed given that 
any one of a known set of side information (SI) frames is available 
at the decoder (e.g., each SI frame may correspond to a different 
stream from which we are switching). However, the use of bit- 
plane coding and channel coding in many DSC approaches leads 
to complex coding and decoding. In this paper, we propose an 
alternative approach for merging multiple SI frames, using a 
piecewise constant (PWC) function as the merge operator. In 
our approach, for each block to be reconstructed, a series of 
parameters of these PWC merge functions are transmitted in 
order to guarantee identical reconstruction given the known side 
information blocks. We consider two different scenarios. In the 
first case, a target frame is first given, and then merge parameters 
are chosen so that this frame can be reconstructed exactly at the 
decoder. In contrast, in the second scenario, the reconstructed 
frame and merge parameters are jointly optimized to meet a 
rate-distortion criteria. Experiments show that for both scenarios, 
our proposed merge techniques can outperform both a recent 
approach based on DSC and the SP-frame approach in H.264, 
in terms of compression efficiency and decoder complexity. 

I. Introduction 

In conventional non-interactive video streaming, a client 
plays back successive frames in a pre-encoded stream in a 
fixed order. In contrast, in interactive video streaming [1], a 
client can switch freely in real-time among a number of pre¬ 
encoded streams. Examples include switching among multiple 
streams representing the same video encoded at different bit- 
rates for real-time bandwidth adaptation [2], or switching 
among views in a multi-view video [3]. See [1] for more exam¬ 
ples of interactive streaming. A major challenge in interactive 
video streaming is to achieve efficient real-time switching 
among pre-encoded video streams. A simple approach would 
be to insert an intra-coded I-frame at each potential switching 
point [4]. But the relatively high rate required for I-frames 
often makes it impractical to insert them frequently in the 
streams, thus reducing the interactivity of playback. 
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Fig. 1. Given the k-th coefficient X^(k) in block b from either SI frame 1 
or 2, a piecewise constant function f{x) maps either one (Xj(k) or x2(fc)) to 
the same Xj,(/c) if they fall on the same constant interval. 


Towards a more efficient stream-switching mechanism, dis¬ 
tributed source coding (DSC) has been proposed. DSC can 
in principle achieve compression efficiency that is a function 
of the worst-case correlation between the target frame and 
the side information (SI) frames (from which the client may 
be switching) [5-7]. As an example, illustrated by Fig. 1, in 
the block-based DCT approach of [7], a desired fc-th quantized 
frequency coefficient value Xf,(fc) in block b of the target frame 
is reconstructed using either Xj^(k} or X^(fc), the corresponding 
coefficients in SI frames 1 and 2, respectively. A D-frame is 
transmitted so that it is possible to reconstruct the exact same 
target frame given any one of the two SI frames [7]. Thus we 
say that the D-frame supports a merge operation. In particular, 
the least significant bits (LSBs) of X^{k) and X^(fc) are treated 
as “noisy” versions of the LSBs of X(,(fc). The most significant 
bits (MSBs) of X(,(fc) are obtained from the MSBs of X^(fc) or 
X^(k), which are identical, while the D-frame contains channel 
codes that can produce the actual LSBs of Xi,{k) taking X^(fc) 
or X^(fc) as inputs. The channel codes associated to these target 
frame coefficients compose the D-frames, which potentially 
require significantly fewer bits than an I-frame representation 
of the target frame [7]. 

There remain significant hurdles towards practical imple¬ 
mentation of D-frames, however. First, the use of bit-plane 
encoding and channel codes in proposed techniques [7] means 
that the computation complexity at the decoder is high. Sec¬ 
ond, because the average statistics of a transform coefficient 
bit-plane for the entire image are used, non-stationary noise 
statistics can lead to high rate channel codes, resulting in 
coding inefficiency. 

In this paper, we propose to use a piecewise constant (PWC) 
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function' as the signal merging operator. This approach oper¬ 
ates directly on quantized frequency coefficients (instead of 
using a bit-plane representation) and does not require channel 
codes. As will be discussed in more detail in Section VTC, our 
signal merging approach can be interpreted as a generalization 
of coset coding [9], where we explicitly optimize the merged 
target values for improved rate-distortion (RD) performance. 
The basic idea of our approach is summarized in Fig. 1, which 
depicts a floor function characterized by two parameters: a 
step size W and a shift c. In our approach, the encoder selects 
W and c to guarantee that X^(k) and X^(k) are in the same 
interval and thus map to the same reconstruction value. A W 
will be chosen for each frequency k, based on the statistics of 
the various Xf,(fc) across all blocks b. Then, given W it will be 
possible to adjust c so that the reconstructed value matches a 
desired target, Xi,(k). A value of c will be chosen for each k 
and b, so that the bitrate required by our proposed technique 
is dominated by the cost of transmitting c. In this paper, we 
will formulate the problem of selecting c and W, and develop 
techniques for RD optimization of this selection. 

We consider two scenarios. In the first one, fixed target 
merging, we will assume that Xf,(fc) has been given, e.g., by 
first generating an intra-coded version of the target frame, and 
using the corresponding quantized coefficient values as targets. 
We will show how to choose W to guarantee that Xfc(k) can 
be reconstructed. We will also show that given W, c is fixed. 
This type of merging is useful when there are cycles in the 
interactive playback, i.e., frame A is an SI frame for frame B 
and B is an SI frame for A. This will be the case in static 
view switching for multiview video streaming, to be discussed 
in Section III. 

In the second scenario, optimized target merging, we select 
W, c and Xf,(fc) based on an RD criteria, where distortion 
is computed with respect to a desired target X®(fc). In this 
scenario, we can use smaller values for W, and no longer 
need to select a fixed c for a given W and Xf,(fc). This allows 
us to optimize c so as to significantly reduce the rate needed 
to encode the merging information. This approach can be used 
when there are no cycles in the interactive playback, e.g., in 
dynamic view switching scenarios (also discussed in Section 
III). Experimental results show significant compression gains 
over D-frames [7] and SP-frames in H.264 [10] at reduced 
decoder computation complexity. 

The paper is organized as follows. We first summarize 
related work in Section II. We then provide an overview of 
our coding system in Section III. We discuss the use of PWC 
functions for signal merging in Section IV. We present our 
PWC function parameter selection methods for fixed target 
merging and optimized target merging in Section V and 
VI, respectively. Finally, we present experimental results and 
conclusions in Section VII and VIII, respectively. 

II. Related Work 

The H.264 video coding standard [11] introduced the con¬ 
cept of SP-frames [10] for stream-switching. In a nutshell, first 
the difference between one SI frame and the target picture is 

*An earlier version of this paper was presented at ICIP 2013 [8], 


lossily coded as the primary SP-frame. Then, the difference be¬ 
tween each additional SI frame and the reconstructed primary 
SP-frame is losslessly coded as a secondary SP-frame; lossless 
coding ensures identical reconstruction between primary and 
each of the secondary SP-frames. One drawback of SP-frames 
is coding inefficiency. Due to lossless coding in secondary SP- 
frames, their sizes can be significantly larger than conventional 
P-frames. Furthermore, the number of secondary SP-frames 
required is equal to the number of SI frames, thus resulting 
in significant storage costs. As we will discuss, our proposed 
scheme encodes only one merge frame for all SI frames, and 
hence the storage requirement is lower than for SP-frames. 

While DSC has been proposed for designing interactive and 
stream-switching mechanisms in the past decade [2,5-7,12], 
partly due to the computation complexity required for bit- 
plane and channel coding in common DSC implementations, 
DSC is not widely used nor adopted into any video coding 
standards. In contrast, in this work, our proposed coding tool 
involves only quantization (PWC function) and entropy coding 
of function parameters, both of which are computationally 
simple. Further, we demonstrate coding gain over a previously 
proposed DSC-based approach [7] in Section VII. 

One of the primary applications of our proposed merge 
frame is interactive media systems, which have attracted 
considerable interest [13]. In particular, a range of media data 
types have been considered for interactive applications in the 
past: images [14], light-fields [15,16], volumetric images [17], 
videos [5,6,18-22] and high-resolution videos [23-26]. While 
it is conceivable that our proposed merge frame can be applica¬ 
ble in some of these use scenarios for which DSC techniques 
have been proposed, here we focus on real-time switching 
among multiple pre-encoded video streams, as discussed in 
Section III. 

This paper extends our earlier work [8], by providing a more 
detailed presentation and evaluation of the system, as well 
as introducing two new concepts. First, we study the fixed 
target merging case (Section V). Second, for the optimized 
target merging case, we develop a new algorithm to compute 
a locally optimal probability function P(c) for shift c —one that 
leads to more efficient entropy coding of c, and small signal 
reconstruction distortion after merging (Section VI). We will 
show in our experiments, described in Section VII, that our 
new algorithm leads to significantly better RD performance 
than our previously published work [8]. 

III. System Overview 
A. IVSS System Overview 

We provide an overview of our proposed coding system 
for interactive video stream switching (IVSS), in which our 
proposed merge frame is a key enabling component. In the 
sequel, a “picture” is a raw captured image in a video 
sequence, while a “frame” is a particular coded version of the 
picture (e.g., I-frame, P-frame). In this terminology, a “picture” 
can have multiple coded versions or “frames”. 

In an IVSS system, there are multiple pre-encoded video 
streams that are related (e.g., videos capturing the same 3D 
scene from different viewpoints [3]). During video playback 
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creates a problem for the following frame(s) that use as 
a predictor for predictive coding, because one does not know 
a priori which reconstructed SI frame Pq\p will be available 
at the decoder buffer for prediction. This illustrates the need 
for our proposed merge frame (called M-frame in the sequel) 
Mq, which is an extra frame corresponding to destination Tlq. 
Correct decoding of Mq means a unique reconstruction of Tlq, 
no matter which SI frame Pq\p is actually available at the 
decoder. 


Fig. 2. Example of an acyclic picture interactivity graph for dynamic view 
switching. Each picture has subscript indicating its view index v and 
time instant t. After viewing picture 112,i of stream 2, the client can choose 
to keep watching the same stream and jump to 112 , 2 , or switch to ni ,2 or 
n 3,2 of stream 1 and 3, respectively. 


of a single stream, at a switch instant, the client can switch 
from a picture of the original stream to a picture of a different 
destination stream. Fig. 2 illustrates an example picture inter¬ 
activity graph for three streams, where there is a switch instant 
every two pictures in time. An arrow Tip Tlq indicates 
that a switch is possible from picture Tip to picture fl^. This 
particular graph is acyclic, i.e., it has no loops and we cannot 
have both Tip —> fl^ and fl^ ^ Tip. 



Fig. 3. Example of a cyclic picture interactivity graph for static view 
switching. Each picture Tly^t has subscript indicating its view index u and 
time instant t. After viewing 112,2 of stream 2, the client can choose to keep 
watching stream 2 in time and jump to 112 , 3 , or change to ni ,2 or 113,2 of 
stream 1 and 3, respectively, con'esponding to the same time instant as n 2 , 2 - 


The scenario in Fig. 2 is an example of dynamic view 
switching [27], where a frame at time t is always followed 
by a frame at time f + 1. In contrast, in static view switching 
a user can stop temporal playback and interactively select the 
angle from which to observe a 3D scene frozen in time [28]. 
Fig. 3 shows an example of static view switching, where the 
corresponding graph is cyclic, i.e., it contains loops so that we 
can have both Tip —> fl^ and fl^ —> Ftp. We will discuss the 
merge frame design for the cyclic case in Section V. 


B. Stream-Switch Mechanism in IVSS 

At a given switch instant, stream switching works as fol¬ 
lows. First, for each possible switch Ftp ^ Tlq, we encode a 
P-frame Pq\p for Fl^, where a decoded version of flp is used as 
a predictor. Reconstructed F^ip is called a side information (SI) 
frame, which constitutes a particular reconstruction of desti¬ 
nation Tlq. Because there are in general multiple origins for a 
given destination (the in-degree for destination picture in the 
picture interactivity graph), there are multiple corresponding 
SI frames. Having multiple reconstructions of the same picture 



Fig. 4. Example of stream-switching from one pre-encoded stream to 
another using merge frame. SI frames and P® are first constructed using 
predictors Pi ,2 and P'lfi, respectively. M-frame Mi ,3 is encoded using the 
two SI frames. I-, P- and M-frames are represented as circles, squares and 
diamonds, respectively. 


As an illustration, in Fig. 4 two P-frames, and P^^^, 
generated from predictors Pi 2 and P 2,2 respectively, are the SI 
frames. An M-frame Mi ,3 is added to merge the SI frames to 
produce an identical reconstruction for Fli 3 . During a stream- 
switch, the server can transmit any one of the two SI frames 
and Ml, 3 leading to the same reconstructed frame for ni, 3 , 
thus avoiding coding drift in the following frame Pi, 4 . Note 
that one P-frame and one M-frame are sent. An alternative 
approach based on SP frames would require sending a primary 
SP-frame Sj ^ (using Pi ,2 as the predictor) for the switch 
FIi ,2 —> FIi, 3 , or a losslessly coded secondary SP-frame 
(using P 2,2 as the predictor) for the switch 112,2 Fli, 3 . SP- 
frame approaches are asymmetric; rate is much lower when 
only a primary SP-frame is needed. In contrast, the switching 
cost using M-frame is always the same (P- and M-frames 
are transmitted). As will be shown, a combination of a P- 
frame and an M-frame requires lower rate than a secondary 
SP-frame. 

C. Merge Frame Overview 

In our proposed M-frame, each hxed-size code block in 
an SI frame is hrst transformed to the DCT domain. DCT 
coefficients are then quantized. The quantized coefficients 
across SI frames (called q-coeffs for short in the sequel) are 
then examined. If the q-coeffs of a given block are very 
different across SI frames, then the overhead to merge their 
differences to targeted q-coeffs would be large. Thus, we will 
encode the block as a conventional intra block. On the other 
hand, if the q-coeffs of a given block are already identical 
across all SI frames, then we can simply inform the decoder 
that the q-coeffs can be used without further processing. 
Finally, if the q-coeffs across SI frames are not identical but 
are similar, then each q-coeff is then merged identically to a 
target value via our proposed merge operator. Hence, together 
there are three coding modes for each code block: intra, skip 
and merge. In this paper, we focus our attention on optimizing 
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TABLE I 

Table of Notations 


N 

number of SI frames 

S" 

SI frame n 

T 

desired target frame 

M 

M-frame 

R(M) 

rate of M-frame M 

D(TT(M)) 

distortion of reconstructed M wit T 

A 

weight parameter to trade off distortion with rate 

Sm 

block group encoded in merge mode 

K 

number of pixels in a code block 

x" 

block b of SI frame S" 

y^k) 

k-th DCT coefficient of block b of SI frame S" 


k-th q-coeff of block b of SI frame S" 

Q 

quantization step size 

Xb(k) 

k-th reconstructed q-coeff of block b 

Zl(k) 

max. pair difference between any pair of X^{k) 

z* {k) 

group-wise max. pair difference, i.e. Z*(/:) 

Ws,Ak) 

step size for k-th q-coeff of block group Sm 

fak) 

shift parameter for k-th q-coeff of block b 

Tb(k) 

feasible range of shift for identical merging 

Zbik) 

max. tai'get diff. between tai'get Xjj(k) and any Xj^'(k) 

ZB^k) 

group-wise max. target difference, i.e. Zfo(k) 

W* (k) 

_ 

step size for k-th q-coeff for fixed target merging 


the parameters in merge mode as the intra and skip modes are 
Straightforward. 


IV. Problem Formulation 


A. Notation 

We first define the notation that will be used in the sequel; 
see Table I for quick reference. We denote the N SI frames 
by S^,..., S^, one of which is guaranteed to be available at 
the decoder buffer when M-frame M is decoded. We denote 
a desired target picture by T and for notational convenience 
we will include it in the set of SI frames as S° = T. 

We denote the group of fixed-size code blocks in M that 
are encoded in merge mode by Sm- Each block has K pixels. 
We denote by x” the b-th block in SI frame S” coded in 
merge mode. Each block x” is transformed into the DCT 
domain as YJJ = [YJJ(O),..Y^(X-l)], where Y^(fc) is the fc-th 
DCT coefficient of x”. We denote by X''(k) the fc-th quantized 
coefficient (q-coejf) given uniform quantization step size Q: 


X"(fc) = round 


\ Q 


( 1 ) 


where round(x) is the standard rounding operation to the 
nearest integer. 


B. Formulation 

We consider two different problems based on the recon¬ 
struction requirement with respect to the desired target T. 
One typically chooses T a priori, e.g., by encoding the target 
picture independently (intra only) and using the decoded 
version as T. The first problem requires the M-frame to 
reconstruct identically to desired target T: 

Problem 1. Fixed Target Merging (Section V). Find M-frame 
M such that the decoder, taking as input any one of the SI 
frames S" and M, can reconstruct T identically as output. 


Because of the differences between SI frames S" and desired 
target T, there may be situations where a high rate is required 
for M (e.g., due to motion in the video sequence, the target 
frame is very different from previously transmitted frames). 
In this case, we allow the reconstruction to deviate from 
desired target T in order to reduce the rate required for M 
by optimizing a rate-distortion criterion: 

Problem 2. Optimized Target Merging (Section VI). Find 
M* and T(M*) so that the decoder, taking as input any one 
of SI frames S” and M*, can always reconstruct T(M*) as 
output, and where M* is an RD-optimal solution for a given 
weight parameter A, i.e., 

M* = are min D(T, T(M)) -t- AR(M), (2) 

M 

where D(T, T(M)) is the distortion incurred (with respect to 
T) when choosing T(M) as the common reconstructed frame, 
and R(M) is the rate needed to transmit M. 

The second problem essentially states that the reconstruc¬ 
tion target T(M) is RD-optimized with respect to desired tar¬ 
get T, while the first problem requires identical reconstruction 
to desired target T. Note that in both problem formulations we 
avoid coding drift since they guarantee identical reconstruction 
for any SI frame, but a solution to Problem 2 will be shown 
to lead to significantly lower coding rates. 


C. Piecewise Constant Function for Single Merging 


A merge operation must, given q-coeff X”(A:) of any SI 
frames S”, n e {1, reconstruct an identical value 

Xi,{k), for all frequencies k. We use a PWC function f{x) 
as the chosen merging operator, with shift c and step size 
W parameters selected for each frequency k of each block 
b encoded in merge mode (see Pig. 1). The selection of these 
parameters influences the RD performance of this merging 
operation for the optimized target merging case. We now 
focus our discussion on how c and IV are selected for each 
coefficient. Because the optimization is the same for each 
frequency k, we will drop the frequency index k for simplicity 
of presentation. 

Examples of PWC functions are ceiling, round, 
floor, etc. In this paper, we employ the floor function^: 


m = 


X-i- C 

W 


w 

w + — 
2 


( 3 ) 


Prom Pig. 1, it is clear that there are numerous combinations of 

parameters W and c such that identical merging is ensured— 

i.e., all X^ map to the same constant interval. Note also that 

the choice of W depends on how spread out the various 

X?,...,X?)' are, that is, how correlated the SI blocks are 
b' ' b ’ 

to each other. In contrast, c is used to select a desired 
reconstruction value X®. Thus, because the level of correlation 
can be assumed to be relatively consistent across blocks, a step 
size is selected once for all blocks b e Sm far a given 
frequency. On the other hand, since the actual reconstruction 


^We define floor function to minimize the maximum difference between 
original x and reconstructed f{x), given shift c and step size W. 
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value will be different from block to block, the shift C\, will 
be selected on a per block basis for a given frequency. 

Before formulating the problem of optimizing the choice of 
c and W, we derive constraints under which this selection is 
made by determining: 

. The minimum value of W that guarantees identical merg¬ 
ing, 

. The choice of c that guarantees correct reconstruction, 

. Effective range of c. 

We first compute a minimum step size W to enable identical 
merging for blocks h in Sm- Let Z* be the maximum pair 
difference between any pair of q-coeffs of a given frequency 
in block h, i.e., 

Zl= max X'- - Xf", (4) 

where X^®’‘ and X^“^ are respectively the maximum and 
minimum q-coeffs among the SI frames, i.e., 


Xf“ = max X" 

® n=0 . N ® 


X““= min X”. 

® n=0,...,N ® 


(5) 


Given Z*, we next define the group-wise maximum pair 
difference Zg^ for the blocks in group Sm'- 

Z*f, =maxZJ'. ( 6 ) 


Since all X'' are integer, Z^^ is also an integer. We can now 
establish a minimum for step size above which identical 
merging for all blocks h e Sm is achievable: 


Fact 1. Minimum Step Size for Identical Merging: a step 
size ^ ^'^t'ge enough for floor function /(X”) 

in (3) to merge any X” in Sm to a same value Xf,. 

Since each S” is a coarse approximation of (and thus is 
similar to) desired target T, the S”’s themselves are similar. 
Hence, the largest difference Z* should be small in the typical 
case. Indeed, we observe empirically that Z* follows an ex¬ 
ponential distribution (one-sided because Z* is non-negative). 
Fig. 5 shows Z* probability distribution for fc = 16 and k = 32. 
We can see that 80% of the blocks have Z* < 5. Assuming 
that Z* follows a Laplacian distribution, the maximum Z^^ is 
typically much larger than the average Z*. This will be shown 
to be useful for the optimized merging of Section VI. 




Fig. 5. Two examples of probability distribution of Z* with three SI frames 
at Q = 1 for Balloons at frequency A: = 16 and k = 32. 

Fact I states that step size is wide enough so that 

can all fall on the same interval in f{x), as 
shown in Fig. 1. However, given shift Cf, must still be 

appropriately chosen per block to achieve identical merging. 


Mathematically, identical merging means that the floor 
function with parameters Cf, and produces the same 

integer output for all inputs X”, that is: 


X” + Cfc 


XO + Cb 

[ Wg„ J 


i Wg, 


Vne ,N}. 


Thus for all X!', we must have for some m e Z that: 
b 


(7) 


mWg„ < X" +Cb<(m + 1)Wg„, 'in e {0,.. .,N] ( 8 ) 


Instead of considering all X^’s, it is sufficient to consider only 
the maximum and minimum values, so that the maximum 
range for Ch that guarantees identical reconstruction is: 

tnWg„-Xp<Cb<(m-)-l)Wg„-X”“ (9) 


for some integer m. Note that given step size Wg^, Cf, and 
Cb + tnWrBfji lead to the same output: 


f{x) = 


x-i-Cb + mWsf, 


Wf 


X Cb 




Sm 

lAt ^ 

^Sm + - ttb 


Wg„ 

Wg,^ + ^ - {cb + mWsJ 


Thus it will be sufficient to consider at most W different values 
of Cb as possible candidates. 

Define a = X™" mod Wg^ and jS = mod 
consider the two possible cases. 

. In case (i) X™'' = + tx and 

where a < f, so that X™ and X™®’' fall in the same 
interval when there is no shift, Cb = 0. Hence we can 
have -a < Cb < Wg^ - jS in order to keep both X™ and 
X^™ in the interval [rnWg^^, (m + l)Wg„). 

. In case (ii) Xj^ = mWsf^+a and X^®’‘ = (m-(-l)Wgj^-i-/S, 
where jS < a, i.e., when Cb = 0, X™*" and X“®’‘ fall in 
neighboring intervals. Here we can have -a < Cb < —p 
to move X““ down to the interval [ffiWg„, (m-t-l)Wgj^), 
or have Wg„ - a < Cb < Wg„ - jS to move X^“^ up to 
the interval [{m + 1)Wg^^, {m -t- 2)Wg„). 

Note that the selection of Wg„ (Fact 1) implies that X^®’‘ - 
X^ < Wsfj, and a = jS only if X^”’ = X““, in which case 
there is no merging needed and any Cb would suffice. 



Fig. 6. Two cases of X”” and X™“ (left: a < ^ and right: a> f) and their 
implications on the feasible range of shift c^. 


The two cases (a < p and a > jS) are illustrated in Fig. 6 . 
Note that given X^™ > X^ by definition, we will be in Case 
(ii) whenever f < a. Thus we can summarize this result as: 

Fact 2. Maximum Feasible Range F'b for Shift Cbt For the 

shift Cb to provide identical merging of q-coejfs X®,... X^ to 
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a same value given step size 

C},eTb = [-a, W>Bm - P) if «< P 

and 

CbeTb = [Ws„ - a, Ws^ - P) if « > jS 
with a = X™ mod Ws„ and |3 = mod 


D. Formulation of Merge Frame RD-Optimization 

In order to formulate the PWC function parameter opti¬ 
mization problem, we first define distortion, db, as the squared 
difference between coefficient Y® of the desired target T and 
reconstructed coefficient /(X°) Q; 

db = \Yl-f{Xl)Qp. (10) 


Because shift Cb will be always chosen within the feasible 
range defined in Fact 2, all q-coeffs X^ will map to the same 
value f{X"), Vn e {0,.. .,N}. Thus we only need to compute 
the distortion for /(X°) in (10). 

For the k-th q-coeff in block group Sm, the encoder will 
have to transmit to the decoder: 

1 ) one step size Wg„(k) > Zgjy(fc) for each group Sm- 

2) one shift Cb{k) for each block h in group Sm- 

The cost of encoding a single for ^‘th q-coeffs 

in group Sm is small, while the cost of encoding YBm\ shifts 
Cb{k) for each of the k-th q-coeffs can be significant. Thus we 
consider only the rate associated to Cb(k) in our optimization. 

Note that since the high-frequency DCT coefficients of a 
given code block are very likely zero, we can insert an End of 
Block (FOB) flag Eb to signal the remaining high-frequency 
q-coeffs in block h in a raster-scan order are 0. Effective use 
of Eb can reduce the amount of transmitted PWC function 
parameters^. In summary, we can define the RD optimized 
target merging problem as: 


min 


X + AXb, 

be^M 


Wsjk) > ZBjk) 
Cb{k) e Tbik) 


( 11 ) 


with distortion Db and rate Rb for block b calculated as: 

Eb K-1 

Db = Yj^b{k)+ X 

*:=0 k=Eb+l 

Eb 

Rb = Yj^iCbik)), 

k=0 


where db{k) is defined in (10) and R{cb{k)) is the rate to 
encode Cb{k). We discuss how we tackle this optimization in 
Section VI. 


V. Fixed Target Merging 

In certain applications, such as the static view switching 
scenario discussed in Section III and illustrated in Fig. 3, the 
picture interactivity graph is cyclic, so that we may have that 

^In the fixed target merging case, Ef, is inserted when the remaining high- 
frequency q-coeffs of a block b in target T are exactly zero. In the optimized 
target case, Ei, can be inserted in an RD-optimal manner on a per-block basis, 
similar to what is done in coding standards such as H.264 [11]. 


rip —* rip and rip —> rip. Because of this interdependency, one 
cannot directly define a simple target merging optimization, 
since optimizing the reconstruction for lip would require first 
fixing a representation (frame) for lip, but optimizing lip 
would in turn require first fixing a representation for lip. 
As a simple alternative we propose fixed target merging, 
where the reconstruction target T for each picture is chosen 
independently from the SI frames. For example, T can be the 
I-frame of the target picture for a given QP. 


A. Fixed Target Reconstruction using Merge Operator 

We first show that given a target reconstruction value a and 
a step size W, we can always And a shift c so that f(x) in 
(3) is such that f{x) = a for all inputs x in the interval [a — 
W/2, fl -b W/2). To see this, first write target reconstruction 
value a = aiW + a 2 , where fli and a 2 = a mod W are integers 
and 0 < a 2 < W. Similarly, we write input x = UiW + X 2 
where integer X 2 can be bounded: 

W W 

a- — < X <a+Y 

W W 

aiW + a2 - — < a\]N + X2 Kai'W + a2 + — 

W W 

ai - — < X2 <a2 + — ( 12 ) 

We now set c = Y — fl 2 - We show that this ensures f{x) = a 
for xe[a- W/2, a -t- W/2): 


/W = 


Ui W -b X 2 -b Y ~ ^2 

w 



= fliW + a2 = a. 


(13) 


where the second line is true because X 2 -b y “ 1*2 in the 
numerator of the “round-down” operator argument can be 
bounded in [0, W) using (12): 

w w w w w 

fl2- 2 ^ ~2 - ~ ^2 + -^- til < tl2 + - 

W 

0 < X2 —0-2 < W (14) 

Next, recall from Section IV-C that we include the desired 
target T as the first SI frame S'^. For a given frequency of 
a particular block b, we first compute the maximum target 
difference Zb as the largest absolute difference between target 
q-coeff X° and X^ of any SI frame S”, i.e., 

Zb^ max 1X“-X,"| (15) 

Based on this we can choose step size and shift based on the 
following lemma. 

Lemma V.l. Choosing step size W* — 2Zb + 2 and shift 
Cb = W*/2 —X° 2 , where X®^ ~ guarantees that 

fiX") = X®, Vn e {0,...,N}. Note that W* is an even number, 
and c is an integer as required. 

Proof: Given shift Cb = W^/2-X°2, showing X^ e [X°- 
W^/2,X0 + W^/2) implies /(X”) = X' ^ I0,...,N}. 
Defining step size W^ = 2Zb -b 2 means the required interval 
for X” can be rewritten as [X° — — 1, X° -b -b 1). By the 
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definition of Zj,, we know X® — < X° + Z;,. Hence 

the required interval for X” is met. ■ 

Note that we can achieve fixed target merging for a given 
X° as long as the step size is larger than W*. For example, we 
can assign the same step size for all blocks in a group 
Sm, so that we reduce the rate overhead: 

W |^=2 + 2 Zs„ (16) 

where Zg^ = max(,gSj^ Z;, is the group-wise maximum target 
difference, and Z^, the block-wise maximum target difference 
for block b, is computed using (15). In summary: 

1) We define a set of blocks Sm and use computed 

using (16) for frequency k of all blocks in Sm- 

2) For block b, we set shift C},{k) = Wg^(A:)/2 - X^^(k), 
where X*^^{k) = X°(A:) mod W^^{k). A different shift is 
used for each frequency k and block b, and transmitted 
as part of the M-frame along with 

VI. Optimized Target Merging 

We now propose a merging approach based on selecting 
^b{k) so as to find a solution to the optimization 
problem described in Section IV-D, where we allow the 
reconstructed value to be different from X^{k). 

If Wsjj is chosen large enough, i.e. - 2 + 2Zs„, then 
we have shown (Lemma V. 1) that one can select shift Cf, to 
reconstruct target q-coeff X° exactly. However, the shifts are 
a function of X®^ = X° mod (Lemma V.l), and thus 
we can expect them to have a uniform distribution, which 
would mean that a rate of the order of log 2 (Ws^) would 
be required as overhead. In order to reduce this rate, we use 
two approaches: i) we allow to be smaller than required 
by Lemma V.l, and ii) when multiple choices of Cj, provide 
identical reconstruction, we optimize this choice based on the 
criteria introduced in Section IV-D. 

A. Selection of 

Note, by definition of we are guaranteed that all X” 
can be within an interval of size as long as > Zg^, 
provided we transmit an appropriate Cf, (Fact 1). Reducing 
from 2 -h 2 Zg„ can reduce the rate required to transmit 
Cft, since Cf, can take at most Wg„ different values. 

As shown in Section IV-C we observe empirically that Z* 
follows a Laplacian distribution (Fig. 5). Thus, for a large 
block group Sm, Z* = maxf,ggj. Z* will be in general much 
larger than Z*. Since Z* > Z^,, in practice for many blocks b it 
is thus possible to reconstruct target X° since Wg > 2 Zf, 2 . 
Thus, we propose to select = Z^ 1, which guarantees 
that for the worst case block all SI values are in the same 
interval, with appropriate choice of Cj, to be discussed next. 

B. RD-optimal Selection of Shifts 

Given a chosen Wg^.,, according to Fact 2 there will be 
multiple values of Cf, that guarantee identical reconstruction for 
all Xy To enable efficient entropy coding of Cf,, it is desirable 
to have a skewed probability distribution P(Cf,) of Cf,. We design 
an algorithm to promote a skewed P(Cfc) iteratively. We first 


propose how to initialize P(Cb), and then discuss how to update 
P(Cb) in subsequent iterations. 

We optimize shift Cj, via the following RD cost function: 

min 4-h A(-logP(Cb)), (17) 

0<Ci<WBjy I 

where the rate term is approximated as the negative log of 
the probability P{ci,) of candidate Cy, and dj, is the distortion 
term computed using (10). The difficulty in using objective 
(17) to compute optimal c* lies in how to define P{Cb) prior 
to selection of Cb- Our strategy is to initialize a skewed 
distribution P{ci,) to promote a low coding rate, perform 
optimization (17) for each block b e Sm, then update P{ci,) 
based on statistics of the selected cfs, and repeat until P(Cb) 
converges. 

In order to choose an initial distribution P(Cb), we note 
that a distribution with a small number of spikes has lower 
entropy than a smooth distribution (see Fig. 7 as an example). 
Choosing Cf, values following such a discrete distribution {e.g., 
left in Fig. 7) means that we reduce the number of possible Cb, 
which may increase db- Thus, if A in (17) is small, in order 
to reduce distortion one can increase the number of spikes 
in P{Cb). In this paper, we propose to induce a multi-spike 
probability P{Cb), where the appropriate number of spikes 
depends on the desired tradeoff between distortion and rate 
in (17). 




Fig. 7. Two examples of shift distribution P(c^). Left distribution has small 
number of spikes and has low entropy (1.22). Right distribution is smooth but 
has high entropy (4.38). 


Since Cj, is constrained to be in the feasible region Tb 
defined in Fact 2, it is possible that when we restrict Cb to 
just a few values as in Fig. 7 (left), there will be some blocks 
b for which none of the “spikes” in P{Cb) fall within their Tb- 
In order to guarantee identical reconstruction they must be 
able to select non-spike values as shifts c^. Thus we propose 
a “spike h- uniform” distribution F(Cf,): 


P(Cb) = 


ifCfc = d 
Pc O.W. 


(18) 


where {c^,..., c^} are the H spikes, each with probability p®, 
and Pc is a small constant for non-spike shift values, pc is 
chosen so that P{Cb) sums to 1. 

1) Computing distribution P(Cb) for fixed H: We now 
discuss how we compute P{Cb) for given H. Empirically we 
observe that for a reasonable number of spikes (e.g., H > 3), 
the majority of blocks (typically 99% or more) in Sm have at 
least one spike in their feasible region 7%. Thus, to simplify 
our computation we first ignore the feasibility constraint and 
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employ an iterative rate-constrained Lloyd-Max algorithm (rc- 
LM) [29] to identify spike locations. 

We illustrate the operations of rc-LM to initialize H spike 

locations for H = 3 as follows. Let c? be the shift value 

b 

that minimizes only distortion for block b. Let g{c°) be the 
probability distribution of distortion-minimizing shift cf for 
blocks in Sm, where 0 < c” < g{(f) can be computed 

empirically for group Sm- Without loss of generality, we define 
quantization bins for the three spikes c®, Cj and Cj as [0, hj), 
[bi,b 2 ) and [h2/Wg^) respectively. The expected distortion 
D({c®}) given three spikes is; 

fji-l 62-1 

Y^\c‘’-cfg(c‘>)+'^\c‘>-c^/g{c‘>)+ £ k“-c“pg(c'') (19) 

c°=bi c°=b2 

where D({c®)) is computed as the sum of squared difference 
between c° and spike d in the bin that c” is assigned to. Having 
defined distortion D({c®)), the initial spike locations c® given H 
spikes can be found as follows: i) construct H spikes evenly 
spaced in the interval [0, ii) use conventional Lloyd- 

Max algorithm with no rate constraints to converge to a set of 
H bin centroids c®. 

Next, adding consideration for rate, the RD cost of the three 
spikes can then be written as; 


D(|dl) + A 


61-1 62-1 

log( L g{d’)) - log( L gid)) - log( t g(d)) 

c°=0 c°=bi cP-h2 

( 20 ) 


(20) is essentially the aggregate of RD costs (17) for all blocks 
in Sm- 

To minimize (20), rc-LM alternately optimizes bin bound¬ 
aries bi and spike locations c® at a time until convergence. 
Given spikes c® are fixed, each bin boundary b; is optimized 
via exhaustive search in the range [c®, c®^j) to minimize both 
rate and distortion in (20). Given bin boundaries b, are fixed, 
optimal c® can be computed simply as the bin average: 


gic‘’)c‘ 


( 21 ) 


where bo = 0 and b^ = Wg^. 

Upon convergence, we can then identify the small fraction 
of blocks with no spikes in their feasible regions Tt and 
assign an appropriate constant pc so that P(C{,) is well defined 
according to (18). Computing P(Cb) with H spikes where 
H + 3 can be done similarly. 

2) Finding the optimal P{c},): To find the optimal P(Cb), 
we add an outer loop for this P(Cb) construction procedure 
to search for the optimal number of spikes H. Pseudo-code 
of the complete algorithm is shown in Algorithm 1. We note 
that in practice, we observe that the number of iterations until 
convergence is small. 


C. Comparison with Coset Coding 

We now discuss the similarity between our proposed ap¬ 
proaches and coset coding methods in DSC [9]. Consider first 
fixed target merging of one q-coeff of a single block b. In a 
scalar implementation of coset coding, given possible SI values 


Algorithm 1 Computing the optimal shift distribution P{Cb) 
1 : for each number of spikes H e [1, Wg„] do 
2 : Initialize distribution P°{Cb) via LM; 

3: t = 0-, 

4: repeat 

5: t = t -hi; 

6 : Update H spike locations c® via (21); 

7: Update bin boundaries b; by minimizing (20); 

8 : Compute Pc for a new P\cb); 

9: until ||P'“Hcb) - P‘{Cb)\\ < e 

10: end for 


X'', n e [1,.. .,N}, seen as “noisy” versions of a target X°, the 
largest difference Zb = max,, |X” - X®| with respect to X® is 
first computed. The size of the coset W is then selected such 
that W > IZb- The coset index 4 = X^ mod W is computed 
at the encoder for transmission. 

At the decoder, the reconstructed value Xb is the integer 
closest to received SI X^ with the same coset index 4, be., 

Xj, = are min |XI[ — X| s.t. 4 = mod W (22) 

” X€Z “ 

Using the aforementioned coset coding scheme for blocks 
b e Sm, coding of ib = X° mod W = X^^ per block is 
necessary, where coset size W is chosen such that W > 2Zg,^,. 
In our fixed target merging scheme using PWC functions, we 
code a shift Cb = W*^/2 - X®^ for each block b, where step 
size is also proportional to 22®^^. Comparing the two 

schemes one can see that the number of choices that need to be 
sent to the decoder is the same (one of possible values in 
both cases). Both the shift value Cj, and 4 are functions of X^^^ 
the LSBs of X°, which are likely to have an approximately 
uniform distribution. Thus so the overhead rate should be the 
same for both coset coding and fixed target merging. 

Consider now the optimized merging case. In this scenario 
we are able to choose = Zg^ -h 1—likely much smaller 
than 2Zs„ < 2Zg^—so that we can still guarantee identical 
reconstruction, with a reduction in rate that comes at the cost 
of an increase in distortion. As for the coset coding approach, 
if we were to reduce to choose a smaller as well, we 
in fact can no longer guarantee identical reconstruction. This 
is because when < 2Z<gj^ there will be cases where not 
all the X^ are in the same interval, and thus the same 4 will 
lead to two different values at the decoder depending on the 
SI received. This imperfect merging will lead to undesirable 
coding drift in the following predicted frames, as discussed in 
Section III. 


VII. Experiments 

We first discuss the general experimental setup and M- 
frame parameter selection (Section VII-A). We then verify the 
effectiveness of our proposed “Spike H- Uniform” distribution 
(Section VII-B). Next, we compare the performance of our 
M-frame in three different situations: 1) static view switching 
(Scenario 1 in Section VII-C); 2) switching among streams 
of different rates for the same single-view video (Scenario 2 
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in Section VII-D), and 3) dynamic view switching of multi¬ 
view videos of different viewpoints and encoded in the same 
bit-rate (Scenario 3 in Section VII-E). 

A. Experimental Setup 

We use four different multiview video test sequences with 
resolution 1024x768 for scenarios 1 and 3: Balloons, 
Kendo"^, Loveblrdl and Newspaper^. The viewpoints 
of each sequence are shown in Table 11. For scenario 2, 
we use four single-view video sequences with resolution 
1920x1080: BasketballDrive, Cactus, Kimonol and 
ParkScene®. 


TABLE II 

Viewpoints of each multiview sequences. 


Sequence Name 

Viewpoints 

Balloons 

1, 3, 5 

Kendo 

1, 3, 5 

Lovebirdi 

4, 6, 8 

Newspaper 

3, 4, 5 


We compare the coding performance of our proposed 
scheme against two schemes^: SP-frame [10] in H.264 and D- 
frame proposed in [30]. QP for D-frame is set to be equal to 
QPsi to maintain consistent quality. For multi-view scenarios 
1 and 3, we encoded three streams from three viewpoints: the 
center view was set as the target, to which the other two side 
views can switch at a defined switching point. For Scenario 
2 , we encoded the single-view video in three different bit- 
rates and then switched among them. The bit-rates for the 
three streams were decided according to additive increase 
multiple decrease (AIMD) rate control behavior in TCP and 
TFRC [31]: one stream has twice the target stream’s bit-rate, 
while the other has slightly smaller bit-rate (0.9 times of the 
target stream’s bit-rate). The results are shown in plots of 
PSNR versus coding rate for a switched frame. 

M-frame parameters are selected as follows. In Scenario 
1 , different QPm will result in different rates, and so we set 
QPm to equal to QPsi, as was done for D-frames. However, 
for optimized target merging, coding rate is determined mainly 
by the number of spikes in the distribution, and not QPm- In 
our experiments, as similarly done in High Efficiency Video 
Coding (HEVC), we first empirically compute A as a function 
of the SI frame’s QPsi'- 

A = (23) 

The number of spikes in the distribution is driven by the se¬ 
lected A. We then set QPm = 1 to maintain small quantization 
error. For mode selection among skip, intra and merge, for 
each block b we first examine q-coeffs X’^{k) of N SI frames. 
If X”(k) of all K frequencies are identical across the SI frames, 

'^http://www. tanimoto.nuee.nagoya-u.ac.jp/mpeg/mpeg_ftv.html 

^ftp://203.253.128.142 

®ftp://ftp. tnt.uni-hannover.de/testsequences/ 

^Here QPa denotes the quantization parameter for coding DCT coefficients 
in approach A 


then block b is coded as skip. Otherwise, selection between 
intra and merge is done based on a RD criteria. 

In HEVC, large code block sizes are introduced which bring 
significant coding gain on high resolution sequences [32]. 
Motivated by this observation, we also investigated the effect 
of different block sizes (4x4, 8x8, 16 X 16) on coding 
performance. We also compare our current proposal against 
the performance of our previous work [8], where block size is 
fixed at 8x8, initial probability distribution of shift P(Cf,) is not 
optimized, and no RD-optimized FOB flag is employed. The 
corresponding PSNR-bitrate curves for scenario 3 are shown 
in Fig. 8. 



(a) Balloons 

Lovebirdi 




kBits/Frame 

(b) Kendo 

Newspaper 



(c) Lovebirdi 


(d) Newspaper 


Fig. 8. PSNR v.s. encoding rate comparison with different block sizes for 
sequences Balloons, Kendo, Lovebirdi and Newspaper. 


From Fig. 8, we observe that block size 16 X 16 provides 
the best coding performance at all bit-rates. One reason for 
the superior performance of large blocks in M-frame is the 
following: because SI frames are already reconstructions of the 
target frames (albeit slightly different), motion compensation 
is not necessary, so the benefit of smaller blocks typical in 
video coding is diminished. We note that in general an optimal 
block size per frame can be selected by the encoder a priori 
and encoded as side information to inform the decoder. In the 
following experiments, the block size will be fixed at 16 X 16 
for best performance. 

Further, we observe also that our proposed method achieves 
a significant coding performance gain compared to our pre¬ 
vious method in [8] over all bit-rate regions, showing the 
effectiveness of our newly proposed optimization techniques. 

B. Effectiveness of “Spike -l- Uniform” Distribution 

In order to verify the effectiveness of our proposed “Spike H- 
Uniform” (SpU) probability distribution P(Cb) for shift param¬ 
eter Cf,, we choose a competing naive distribution for P(Cf,) 
as follows: first, we compute distortion-minimizing g(c^) as 
the initial probability distribution. Next, we compute the RD- 
optimal Cfc for each block b e Sm via (17) for a single iteration 
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using the initialized probability distribution and compute a new 
P'(Cf,). This P'{Ci,) is then used to compute the rate to encode 
each Ci, of a merge block h. The difference between P'{C},) and 
our proposed P\cb) is that P'{c},) in general is an arbitrarily 
shaped distribution, not a skewed “spiky” distribution. Experi¬ 
mental results of M-frame using these distributions are shown 
in Fig. 9. 




(a) Balloons 


(b) Kendo 



Fig. 9. PSNR v.s. encoding rate comparison with different block sizes for 
sequences Balloons, Kendo. 



2000 

kBits/Frame 



Newspaper 


(c) Lovebirdl 



We observe from Fig. 9 that our proposed SpU distribution 
outperforms the naive distribution in the high bit-rate region 
and is comparable in the low bit-rate region. This is because in 
the low bit-rate region A is very large, so that for any initial 
distribution, after one iteration, there will only remain one 
spike, and the number of iterations required for convergence 
is very small. 

C. Scenario 1: Static View Switching 

We first test our proposed M-frame in the static view 
switching scenario for multi-view sequences. Three views are 
encoded using same QP. The fixed target merging algorithm 
described in Section V is used to facilitate switching to 
neighboring views among pictures of the same instant, as 
shown in Fig. 3. 

Specifically, we constructed M- / D- frames to enable static 
view-switching from view 1 or 3 to target view 2. We first 
use H.264 to encode two SI frames (P-frames) using 112,2 as 
the target and ni,2 and TI3 2 as predictors, respectively. This 
results in encoded rates ‘R12 and %2,2 for the two SI frames, 
respectively. Then we encoded a M- / D- frame to merge these 
two SI frames identically to 112,2 • The corresponding rates for 
M-frame and D-frame are and 1^2' respectively. Since 
SP-frame in H.264 cannot perform fixed target merging, it is 
not tested in this scenario. 

We assume that the switching probability is equal on both 
view 1 and 3, which is 0.5. Then the overall rate for the D- 
frame is calculated as: 


Fig. 10. PSNR v.s. encoding rate compaiing proposed M-frame using fixed 
target merging scheme with D-frame for sequences Balloons, Kendo, 
Lovebirdl and Newspaper in static view switching scenario. 

TABLE III 

BD-RATE REDUCTION OF PROPOSED M-FRAME USING FIXED TARGET 
MERGING SCHEME COMPARED TO D-FRAME IN STATIC VIEW SWITCHING 
SCENARIO. 


Sequence Name 

M-frame vi. D-frame 

Balloons 

-31.7% 

Kendo 

-40.1% 

Fovebirdl 

-35.7% 

Newspaper 

-31.1% 


Table III that our proposed M-frame using fixed target merging 
scheme achieved up to 40.1% BD-rate reduction compared to 
D-frame. Further, from Fig. 10 we observe that our M-frame 
is better than D-frame in all bit-rate regions, especially in low 
and high bit-rate region, mainly due to the skip block and EOB 
flag tools. In high bit-rate region, due to the small distortion 
in SI frames, more blocks will be classified into skip block, 
which efficiently reduces the bits to encode the M-frame, while 
in low bit-rate region more coefficients are set to zero and 
skipped due to the EOB flag. This shows the effectiveness 
of our proposed M-frame using fixed target merging scheme 
compared to the D-frame. 

D. Scenario 2: Bit-rate Adaptation 


= (24) 

Also, the overall rate for our proposed M-frame using fixed 
target merging scheme is calculated as: 

(25) 

The coding results are shown in Fig. 10 and BD-rate [33] 
comparison can be found in Table III. We observe from 


We next conducted experiments of bitrate adaptation sce¬ 
nario for single-view video sequences. M-frame is encoded 
in a RD-optimized manner, described in section VI with the 
system framework shown in Fig. 2. Three streams of different 
rates are encoded according to AIMD rate control behavior. 

We constructed M- / D- frames to enable stream-switching 
from stream 1, 2 or 3 to target stream 2 under different bit- 
rates. We first encode three SI frames using 112,2 as target 
and rii,!, n2,i and fla,! as reference respectively. This results 









































PSNR PSNR 
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in encoded rate i, !R2,i and 'Rj, \ for the three SI frames, 
respectively. Then we encoded a M- / D-frame to merge these 
three SI frames into an identical frame. The corresponding rate 
for M-frame and D-frame are and ^^2’ respectively. 

We also constructed SP-frames to enable stream-switching 
from stream 1, 2 or 3 to target stream 2. We first encoded a 
primary SP-frame using 112.2 as target and 112.1 as reference. 
We then losslessly encoded two secondary SP-frames using 
the primary SP-frame as target and lli.i, lls.i as reference 
respectively. ^ denotes the rate for primary SP-frame while 
and R^^ denote the rate for two secondary SP-frames. 

As measure for transmission rate, we consider both the 
average and worst case code rate during a stream-switch. For 
average case, in the absence of application-dependent infor¬ 
mation, we assume that the probability of stream-switching is 
equal for all views. Thus, the overall rate for RD optimized 
M-frame is calculated as: 


cpM _ + '^ 2.1 + R%\ td 

^ Ta - -3- + ^2,2- 


(26) 


The overall rate for D-frame is calculated as: 


R 


D 

Ta 


Rl,l + + R3,\ 

3 


+ ^°2- 


(27) 


all bit-rate regions. Note that for the SP-frame case, if the 
switching probability to the primary SP-frame is higher, it 
will result in a smaller average rate. 

For worst case, the code rate for M-frame is calculated as: 

R^^ = maxC??!.!, %i, -i- R^.^. (29) 

The rate for D-frame is calculated as: 

“ iriax(7?i.i,'??2.i/'^3,i) + ‘^^2- (^®) 

The rate for SP-frame is calculated as: 

R^l=me,x{Rl^,R^^^,Ry. (31) 


BasketballDrive 




(a) BasketballDrive 


(b) Cactus 


The overall rate for SP-frame is calculated as: 


{qSP _ 


7??, + R^^ 


1,1 


BasketballDrive 



■ 2,1 


+ R^ 

3,1 


(28) 




4000 

kBits/Frame 


(c) Kimonol 



(d) ParkScene 


Fig. 12. PSNR versus encoding rate comparing RD-optimized M-frame 
with D-frame and SP-frame for sequences BasketballDrive, Cactus, 
Kimonol and ParkScene in worst case. 


(a) BasketballDrive 


(b) Cactus 


Kimonol ParkScene 




The coding results of worst case are shown in Fig. 12 and 
BD-rate comparison can be found in Table IV. We observe 
from Table IV that our proposed RD-optimized M-frame 
achieves up to 65.4% BD-rate reduction compared to D-frame 
and 49.9% BD-rate reduction compared to SP-frame. 

We observe in Table IV that the performance difference be¬ 
tween average and worst case for D-frame is small. However, 
for SP-frame the performance difference between average and 
worst case is large. This is due to lossless coding in secondary 
SP-frames, resulting in a much larger size than primary SP- 
frame (typically 10 times larger). 


Fig. 11. PSNR versus encoding rate comparing proposed RD-optimized 
M-frame with D-frame and SP-frame for sequences BasketballDrive, 
Cactus, Kimonol and ParkScene in average case. 

The coding results of average case are shown in Fig. 11 
and BD-rate comparison can be found in Table IV. We 
observe from Table IV that our proposed RD-optimized M- 
frame achieves up to 65.6% BD-rate reduction compared to 
D-frame and 36.3% BD-rate reduction compared to SP-frame. 
Moreover, from Fig. 11 we observe that our proposed RD- 
optimized M-frame is better than D-frame and SP-frame in 


E. Scenario 3: Dynamic View Switching 

Finally we conducted experiments of dynamic view switch¬ 
ing scenario for multiview video sequences. Three views 
are encoded using same QP. The detailed frame structure 
for M-frame, D-frame and SP-frame are the same as in 
Section VII-D. Also, the overall rate calculation for average 
and worst case are identical too. 

The coding results of dynamic view switching for average 
case and worst case are shown in Fig. 13 and 14 respectively. 








































PSNR PSNR 


IEEE TRANSACTIONS ON IMAGE PROCESSING, SEPTEMBER 2015 


12 


TABLE IV 

BD-rate reduction of RD-optimized M-frame compared to D-frame and SP-frame of scenario 2. 


Sequence Name 

M-frame v'i. D-frame 

M-frame vi. SP-frame 

Average Case 

Worst Case 

Average Case 

Worst Case 

Balloons 

-63.4% 

-63.7% 

-17.0% 

-39.4% 

Kendo 

-63.5% 

-63.2% 

-18.8% 

-42.1% 

Lovebirdi 

-65.6% 

-65.4% 

-36.3% 

-49.9% 

Newspaper 

-56.3% 

-56.7% 

-19.5% 

-43.8% 


TABLE V 

BD-rate reduction of RD-optimized M-frame compared to D-frame and SP-frame of scenario 3. 


Sequence Name 

M-frame v'i. D-frame 

M-frame vi. SP-frame 

Average Case 

Worst Case 

Average Case 

Worst Case 

Balloons 

-55.1% 

-53.0% 

-19.2% 

-35.0% 

Kendo 

-53.8% 

-53.6% 

-19.3% 

-36.4% 

Lovebirdi 

-57.5% 

-58.7% 

-11.3% 

-28.7% 

Newspaper 

-51.6% 

-50.4% 

-5.0% 

-12.9% 



(a) Balloons 


Kendo Balloons 




(b) Kendo (a) Balloons 



(b) Kendo 



Newspaper Lovebirdi 




(d) Newspaper (a) Lovebirdi 


Newspaper 



Fig. 13. PSNR versus encoding rate compai'ing proposed RD-optimized M- 
frame with D-frame and SP-frame for sequences for sequences Balloons, 
Kendo, Lovebirdi and Newspaper in average case. 


Fig. 14. PSNR versus encoding rate compai'ing proposed M-frame with 
D-frame and SP-frame for sequences for sequences Balloons, Kendo, 
Lovebirdi and Newspaper in worst case. 


BD-rate comparison for average case and worst case can be 
found in Table V. From Table V we observe that our proposed 
RD-optimized M-frame achieves 57.5% BD-rate reduction 
compared to D-frame and 19.3% BD-rate reduction compared 
to SP-frame. From Table V we observe that our proposed 
RD-optimized M-frame achieves 58.7% BD-rate reduction 
compared to D-frame and 36.4% BD-rate reduction compared 
to SP-frame. 

VIII. Conclusion 

In this paper, we propose a new merging operator— 
piecewise constant (PWC) function—for merging different 


reconstructed versions of a target frame to a unique one—to 
enable stream switching while preserving coding efficiency. 
Specifically, in order to merge k-th transform coefficients of 
different side information (SI) frames to the same value, we 
encode appropriate step sizes and horizontal shift parameters 
of a floor function, so that all the SI coefficients fall on the 
same function step. We propose two methods to select floor 
function parameters for signal merging. In the first method, we 
selected parameters so that coefficients are merged identically 
to a pre-determined target value. In the second method, the 
merged target value can be RD-optimized to induce better 
coding performance. Experimental results show that for both 
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cases, our proposed merge frame has significant coding gain 
over an implementation of DSC frame and H.264 SP-frames 
with a reduction in decoder complexity. 
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